Prosodic modeling for improved speech recognition and understanding

نویسنده

  • Chao Wang
چکیده

The general goal of this thesis is to model the prosodic aspects of speech to improve humancomputer dialogue systems. Towards this goal, we investigate a variety of ways of utilizing prosodic information to enhance speech recognition and understanding performance, and address some issues and difficulties in modeling speech prosody during this process. We explore prosodic modeling in two languages, Mandarin Chinese and English, which have very different prosodic characteristics. Chinese is a tonal language, in which intonation is highly constrained by syllable F0 patterns determined by lexical tones. Hence, our strategy is to focus on tone modeling and account for intonational aspects within the context of improving tone models. On the other hand, the acoustic expression of lexical stress in English is obscure and highly influenced by intonation. Thus, we examine the applicability of modeling lexical stress for improved speech recognition, and explore prosodic modeling beyond the lexical level as well. We first developed a novel continuous pitch detection algorithm (CPDA), which was designed explicitly to promote robustness for telephone speech and prosodic modeling. The algorithm achieved similar performance for studio and telephone speech (4.25% vs. 4.34% in gross error rate). It also has superior performance for both voiced pitch accuracy and Mandarin tone classification accuracy compared with an optimized algorithm in xwaves. Next, we turned our attention to modeling lexical tones for Mandarin Chinese. We performed empirical studies of Mandarin tone and intonation, focusing on analyzing sources of tonal variations. We demonstrated that tone classification performance can be significantly improved by taking into account F0 declination, phrase boundary, and tone context influences. We explored various ways to incorporate tone model constraints into the summit speech recognition system. Integration of a simple four-tone model into the first-pass Viterbi search reduced the syllable error rate by 30.2% for a Mandarin digit recognition task, and by 15.9% on the spontaneous utterances in the yinhe domain. However, further improvements by using more refined tone models were not statistically significant. Leveraging the same mechanisms developed for Mandarin tone modeling, we incorporated lexical stress models into spontaneous speech recognition in the jupiter weather domain, and achieved a 5.5% reduction in word error rate compared to a state-of-the-art baseline performance. However, our recognition results obtained with a one-class (including all vowels) prosodic model seemed to suggest that the gain was mainly due to the elimination of implausible hypotheses, e.g., preventing vowel/non-vowel or vowel/non-phone confusions, rather than by distinguishing the fine differences among different stress and vowel classes.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Prosody Modeling for Automatic Speech Recognition and Understanding

This paper summarizes statistical modeling approaches for the use of prosody (the rhythm and melody of speech) in automatic recognition and understanding of speech. We outline effective prosodic feature extraction, model architectures, and techniques to combine prosodic with lexical (word-based) information. We then survey a number of applications of the framework, and give results for automati...

متن کامل

Improved Bayesian Training for Context-Dependent Modeling in Continuous Persian Speech Recognition

Context-dependent modeling is a widely used technique for better phone modeling in continuous speech recognition. While different types of context-dependent models have been used, triphones have been known as the most effective ones. In this paper, a Maximum a Posteriori (MAP) estimation approach has been used to estimate the parameters of the untied triphone model set used in data-driven clust...

متن کامل

A Database for Automatic Persian Speech Emotion Recognition: Collection, Processing and Evaluation

Abstract   Recent developments in robotics automation have motivated researchers to improve the efficiency of interactive systems by making a natural man-machine interaction. Since speech is the most popular method of communication, recognizing human emotions from speech signal becomes a challenging research topic known as Speech Emotion Recognition (SER). In this study, we propose a Persian em...

متن کامل

A Frame-Synchronous Prosodic Decoder for Text-Independent Dialog Act Recognition

Dialog act (DA) recognition is an important intermediate task is speech understanding systems. Although past research has demonstrated that prosody can improve the performance of recognizers relying primarily on words, how prosody fares on its own is not well understood. The current work continues an ongoing investigation into settings in which both words and word boundaries are unavailable, wh...

متن کامل

Spontaneous Mandarin Speech Recognition with Disfluencies Detected by Latent Prosodic Modeling (LPM)

In this paper, a new approach for improved spontaneous Mandarin speech recognition using Latent Prosodic Modeling (LPM) for disfluency interruption point (IP) detection is presented. The basic idea is to detect the disfluency interruption points (IPs) prior to the recognition, and then to incorporate these information into the recognition process via the second pass rescoring. For accurate dete...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2001